-
Notifications
You must be signed in to change notification settings - Fork 1.5k
AlibabaCloud: fix for bootstrap teardown removing slb backends #5682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
/test e2e-alibaba |
|
/test e2e-alibaba |
|
@jsafrane The e2e-alibaba runs and is able to install with this PR. |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bootstrap VM is not being removed from the load balancers' backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a virtual server group alicloud_slb_server_group can be used, of course, this is not required
@kwoodson Have you located the cause of the issue? can you briefly describe it, thanks. I checked the resourceAliyunSlbBackendServersDelete, but found no problem |
|
@bd233 The issue is when the teardown step of the bootstrap happens, all of the servers are removed from the SLB. This causes the cluster installation to fail as the By adding this extra step to the installation then all nodes successfully get added and the bootstrap will be removed once the instance is released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have been overlooked from the message on my first review, but there is no code run for this stage during bootstrap destroy to remove the bootstrap VM from the backend of the slb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@staebler I'm sorry for the late response. Just returning from PTO.
When I tested this code previously, the bootstrap release automatically removes the instance from the SLB list of backends. I'll test this morning and re-verify that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this handled implicitly by the bootstrap VM being deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there is any value in explicitly removing the bootstrap VM from the load balancer? If we are going to rely on implicit behavior, then we at least need some good comments explaining why the bootstrap VM does not need to be explicitly removed from the load balancer. A year from now, I am going to look at this code and not remember why we are not removing from the load balancer: I am going to think it was an oversight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the effective difference is that the master backends are created after the bootstrap backends, right? If that works, I think that is a cleaner approach. I would like to see some comments on the stage explaining why the master backends need to be added after the bootstrap backends.
I also think that this will be a bit easier to revert once the bug is fixed in the terraform provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running into the same issue now. Wondering if there is a race condition upon running the next stage. I've had a successful cluster install and a failure. The bootstrap destroy ran but the slb_attachment stage never ran.
DEBUG Plan: 0 to add, 0 to change, 11 to destroy.
...
DEBUG Destroy complete! Resources: 11 destroyed.
<-- loadbalancer backends were removed here -->
DEBUG Loading Install Config...
DEBUG Loading SSH Key...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Cluster Name...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Networking...
DEBUG Loading Platform...
DEBUG Loading Pull Secret...
DEBUG Loading Platform...
DEBUG Using Install Config loaded from state file
INFO Waiting up to 40m0s (until 3:20PM) for the cluster at https://api.test.alicloud-dev.devcluster.openshift.com:6443 to initialize...
W0316 14:40:59.738715 1143968 reflector.go:324] k8s.io/client-go/tools/watch/informerwatcher.go:146: failed to list *v1.ClusterVersion: Get "https://api.test.alicloud-dev.devcluster.openshift.com:6443/apis/config.openshift.io/v1/clusterversions?fieldSelector=metadata.name%3Dversion&limit=500&resourceVersion=0": dial tcp 47.253.132.94:6443: connect: connection refused
@staebler Any ideas why the next stage did not proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a full log that you can show me? I suspect that you are not running the right binary if the "slb_attach_controlplane" stage did not run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some troubleshooting in cluster.go and watching the stages, I can see that the stages execute sequentially. The cluster, bootstrap, and then the attachment run successfully. Then the teardown occurs for the bootstrap which removes the backend servers from the SLB.
Is there a way to defer the attachment until the teardown of the bootstrap is complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't want to defer the attachment. You want the masters attached while bootstrapping is proceeding.
fa07d30 to
34c1ab1
Compare
10d9bab to
6af2230
Compare
3980db9 to
5d9ab1f
Compare
|
/test e2e-alibaba |
|
@staebler I think this was in line with what you were suggesting previously. This mirrors gcp's functionality and is variable driven. This was successful for cluster creation. PTAL |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was in line with what you were suggesting previously.
Indeed!
Note that this will cause a merge conflict with #5715.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| load_balancer_id = "${element(var.slb_ids, count.index)}" | |
| load_balancer_id = var.slb_ids[count.index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the quotes and dollar sign noise any more.
| count = "${length(var.slb_ids)}" | |
| count = length(var.slb_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| server_id = "${backend_servers.value}" | |
| server_id = backend_servers.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this in data/data/alibabacloud/variables-alibabacloud.tf instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a newline.
pkg/asset/cluster/cluster.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this superfluous change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "github.com/hashicorp/terraform-exec/tfexec" | |
| "github.com/openshift/installer/pkg/terraform" | |
| "github.com/openshift/installer/pkg/terraform/providers" | |
| "github.com/openshift/installer/pkg/terraform/stages" | |
| alitypes "github.com/openshift/installer/pkg/types/alibabacloud" | |
| "github.com/pkg/errors" | |
| "github.com/hashicorp/terraform-exec/tfexec" | |
| "github.com/pkg/errors" | |
| "github.com/openshift/installer/pkg/terraform" | |
| "github.com/openshift/installer/pkg/terraform/providers" | |
| "github.com/openshift/installer/pkg/terraform/stages" | |
| alitypes "github.com/openshift/installer/pkg/types/alibabacloud" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // all of the controlplane backends to be removed from the SLB. This stage reattaches only the | |
| // master controlplane nodes by calling apply with the ali_boostrap_lb set to false. | |
| // all of the controlplane backends to be removed from the SLB. This stage attaches the | |
| // master VMs and the bootstrap VMs as backend servers to the SLBs on create. Later, | |
| // on bootstrap destroy, this stages removes only the bootstrap VM from the backend servers. |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit. Otherwise it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is not correct.
| description = "The load balancer IDs of the bootstrap ECS." | |
| description = "The IDs of the load balancers to which to attach the bootstrap and control plane VMs." |
|
Latest changes are successful. I'll update one more time to address the nit. |
staebler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: staebler The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
8 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
I'm not sure what this job has to do with okd-e2e-aws-upgrade. Might be good to be able to only test the necessary areas of change for this Alibaba work. Let's see what happens. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
@kwoodson: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
AlibabaCloud provider has an issue with the recent terraform provider updates. When the bootstrap is torn down the SLB server backends are removed. This seems to be an issue in the terraform provider.
This code removes the teardown of the SLB for the bootstrap. This adds an additional stage to add all of the control plane nodes including the bootstrap. When the bootstrap is torn down the instance is removed automatically from the SLB (service load balancer).